Improving SMT with Morphology Knowledge for Baltic Languages
نویسنده
چکیده
In the recent years, several machine translation systems have been built for the Baltic languages. Besides Google and Microsoft machine translation engines and research experiments with statistical MT for Latvian [1] and Lithuanian, there are both English-Latvian [2] and English-Lithuanian [3] rulebased MT systems available. Both Latvian and Lithuanian are morphologically rich languages with quite free word order. In combination with the limited availability of parallel corpora for these languages, it poses a sparseness problem for phrase-based SMT. This research is a part of a project to build the best general-purpose phrase-based SMT using publicly available and proprietary corpora and tools. During the project we added language-specific knowledge to assess the possible improvement of translation quality. This paper reports on implementation, as well as automatic and human evaluation of EnglishLatvian and Lithuanian-English statistical machine translation systems. Results of human evaluation show that integrating morphology knowledge into SMT gives significant improvement of translation quality compared to baseline SMT.
منابع مشابه
Improving SMT for Baltic Languages with Factored Models
This paper reports on implementation and evaluation of English-Latvian and Lithuanian-English statistical machine translation systems. It also gives brief introduction of project scope – Baltic languages, prior implementations of MT and evaluation of MT systems. In this paper we report on results of both automatic and human evaluation. Results of human evaluation show that factored SMT gives si...
متن کاملSMT of Latvian, Lithuanian and Estonian Languages: a Comparative Study
This paper is an attempt to discover the main challenges in working with Baltic and Estonian languages, and to identify the most significant sources of errors generated by a SMT system trained on large-vocabulary parallel corpora from legislative domain. An immense distinction between Latvian/Lithuanian and Estonian languages causes a set of non-equivalent difficulties which we classify and com...
متن کاملReal-world challenges in application of MT for localization: the Baltic case
In this paper we share our experience from implementing machine translation in localization into relatively small languages of the three Baltic countries – Latvian, Lithuanian, and Estonian. We describe our approach in improving terminology translation and consistency by preprocessing of the source text and performing term integration. We present results of a formal evaluation of MT impact on t...
متن کاملModelling Linguistic Phenomena with Unsupervised Morphology for Improving Statistical Machine Translation
This work studies an ascetic approach to statistical machine translation. We assume that only a small parallel corpus is available, and no other monoor bilingual corpora or linguistic tools can be used, which is the case for many resource-scarce languages. Our aim is to find out how a baseline SMT system can be improved under this condition. In such a case one of the natural choices is to use u...
متن کاملUsing POS Information for SMT into Morphologically Rich Languages
When translating from languages with hardly any inflectional morphology like English into morphologically rich languages, the English word forms often do not contain enough information for producing the correct fullform in the target language. We investigate methods for improving the quality of such translations by making use of part-ofspeech information and maximum entropy modeling. Results fo...
متن کامل